Maerl Whole Genome Genotyping Analysis

1. DNA sequence data and filtering

Maerl DNA was extracted using the Qiagen Blood & Tissue Kit and sent to SNPsaurus for whole genome genotyping. Libraries were sequenced on a paired-end 2x150bp NovaSeq 5000 S4 lane. The raw reads were filtered using Fastp.

Rename all raw reads
The names of all raw reads provided began with "TJ-". This was removed using the rename command.
rename 's/TJ-//' *.fastq.gz

Print the number of raw reads per sample in parallel
This code prints the names of all fastq.gz files, calculates the number of reads per sample, and prints both the sample name and the output.
ls *fastq.gz | parallel --keep-order 'echo -n "{} "; zcat {} | grep -c "^@NGS"' > rawread_counts.txt

Clean raw reads using Fastp
First, prepare a bash script called runFastp.sh that accepts a sample of paired reads and executes Fastp.

PROC=4  
R1_FQ="$1"  
R2_FQ="$2"  
outDIR=/data2/tjenks/maerl_genomics/clean_reads  
rawDIR=/data2/tjenks/maerl_genomics/raw_reads  
fastp -i $rawDIR/${R1_FQ} -I $rawDIR/${R2_FQ} -o $outDIR/${R1_FQ%_*}.fastq.gz -O $outDIR/${R2_FQ%_*}.fastq.gz -q 20 --trim_poly_x --length_required 100 --thread ${PROC}
Parameter Description
-q 20 base phred quality >= 20
--trim_poly_x perform both polyG (enabled by default) and polyX tail trimming
--length-required 100 discard reads < 100 bp

Second, navigate to the directory containing the raw reads and run the following code. This writes another bash script which sets up the input files for the runFastp.sh script.
paste <(ls -1 *R1_001.fastq.gz) <(ls -1 *R2_001.fastq.gz) | awk '{print "bash runFastp.sh", $1, $2}' > ../clean_reads/clean_reads.sh

Finally, run the clean_reads.sh script.
bash clean_reads.sh

Print the number of high quality clean reads per sample
ls *fastq.gz | parallel --keep-order 'echo -n "{} "; zcat {} | grep -c "^@NGS"' > cleanread_counts.txt

No. of samples Total raw reads Total HQ reads % Retained
95 1,158,303,200 1,101,098,746 91.5

2. Extract organelles

Extract the mitochondrial and plastid genomes from the cleaned reads using the programs GetOrganelle and Unicycler. For a guide to analysing NGS-derived organelles see Song et al. (2016). Each organelle extracted was compared to the NCBI GenBank database using BLASTn to check they matched other maerl species.

Software
GetOrganelle 1.6.4
Bowtie2 2.4.1 (ensure that the path to Bowtie2 version 2.4.1 is specified because GetOrganelle failed with previous versions)
Unicycler 0.4.9b
SPAdes 3.14.0
BLAST+ 2.9.0
Bandage

2.1 Mitogenome

Download DNA sequence data to use as seeds

Species GenBank ID Organelle Locus Length (bp)
P. calcareum KF808323 Mitochondrion COI partial sequence 651
Lithothamnion sp. MH281621 Mitochondrion Complete genome 25,605
L. corallioides KC861502 Mitochondrion COI partial sequence 664
L. glaciale HM918977 Mitochondrion COI partial sequence 664

Extract mitogenome from one sample

First, extract target reads using GetOrganelle.
get_organelle_from_reads.py -1 read1.fq.gz -2 read2.fq.gz -o output_dir -s Lithothamnion_mitogenome.fasta,Pcalcareum_COI.fasta -F embplant_mt,embplant_mt -t 8 --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ -R 50 --no-spades

Parameter Description
-1, -2 forward read, reverse read
-s sequences for initial seed
-F target organelle type (embplant_mt, animal_mt, fungus_mt)
-R 50 maximum number of extension rounds
--no-spades do not assemble reads using SPAdes

Second, assemble the target reads using Unicycler.
unicycler -1 filtered_1_paired.fq -2 filtered_2_paired.fq -o unicycler_assembly --mode conservative

Lastly, view the assembled contig graph (*.gfa) in Bandage to check for circularity and BLAST the sequence on the NCBI nucleotide database to check sequence similarity.

In [1]:
# Example of a complete circular mitogenome from Tre04 sample
from IPython.display import Image
Image("Jupyter_files/Tre04_Bandage_graph.png", width=500, height=500)
Out[1]:

Extract mitogenomes for all samples using bash script

First, prepare a bash script called runGetMitogenome.sh that accepts a sample of paired reads, executes GetOrganelle, and then assembles the target reads using Unicycler.

#!/bin/sh

PROC=8
R1_FQ="$1"
R2_FQ="$2"
readDIR=/data2/tjenks/maerl_genomics/clean_reads
seedDIR=/data2/tjenks/maerl_genomics/get_organelle/seed_sequences

get_organelle_from_reads.py -1 $readDIR/${R1_FQ} -2 $readDIR/${R2_FQ} -o ${R1_FQ%_*} -s $seedDIR/Lithothamnion_mitogenome.fasta,$seedDIR/Pcalcareum_COI.fasta -F embplant_mt,embplant_mt -t ${PROC} --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ --prefix ${R1_FQ%_*} -R 50 --no-spades

cd ${R1_FQ%_*}

unicycler -1 ${R1_FQ%_*}filtered_1_paired.fq -2 ${R1_FQ%_*}filtered_2_paired.fq -o unicycler_assembly --mode conservative

Second, navigate to the directory containing the clean reads and run the following code which writes another bash script that sets up the input files to runGetMitogenome.sh.
paste <(ls -1 *R1.fastq.gz) <(ls -1 *R2.fastq.gz) | awk '{print "bash rungetMitogenome.sh", $1, $2}' > ../dir_path/runBash.sh

Finally, execute the runBash.sh script.
bash runBash.sh

Notes:
When the organelle genome was manually extracted from the assembly graph in Bandage (on my Windows laptop), the resulting FASTA text file was encoded in Windows. This must be changed in notepad++ to Unix (LF) for it to work with some Unix programs.

Rotate sequences to start at the same position
To rotate all sequences to start at the same position, run the program MARS (Multiple circular sequence Alignment using Refined Sequences) which takes a FASTA file of sequences as input.
mars -a DNA -i input.fasta -o rotated.fasta

2.2 Plastome

Download DNA sequence data to use as seeds

Species GenBank ID Organelle Locus Length (bp)
P. calcareum KC819266 Plastid psbA partial sequence 889
P. calcareum MH274809 Plastid rbcL partial sequence 1,322
Lithothamnion sp. MH281627 Plastid Complete genome 183,822
L. corallioides KC819265 Plastid psbA partial sequence 888
L. glaciale KC819271 Plastid psbA partial sequence 853

Extract plastome from one sample

First, extract target reads using GetOrganelle.
get_organelle_from_reads.py -1 read1.fq.gz -2 read2.fq.gz -o output_dir -s Lithothamnion_plastome.fasta,Pcalcareum_psbA_rbcL.fasta -F embplant_pt,embplant_pt -t 8 --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ --no-spades

Parameter Description
-1, -2 forward read, reverse read
-s sequences for initial seed
-F target organelle type (embplant_pt, animal_pt, fungus_pt)
--no-spades do not assemble reads using SPAdes

Second, assemble the target reads using Unicycler.
unicycler -1 filtered_1_paired.fq -2 filtered_2_paired.fq -o unicycler_assembly --mode conservative

Lastly, view the assembled contig graph (*.gfa) in Bandage to check for circularity and BLAST the sequence on the NCBI nucleotide database to check sequence similarity.

In [2]:
# Example of a complete circular plastome from Arm01 sample
from IPython.display import Image
Image("Jupyter_files/Arm01_Bandage_graph.png", width=500, height=500)
Out[2]:

Extract plastomes for all samples using bash script

First, prepare a bash script called runGetPlastome.sh that accepts a sample of paired reads, executes GetOrganelle, and then assembles the target reads using Unicycler.

#!/bin/sh

PROC=8
R1_FQ="$1"
R2_FQ="$2"
readDIR=/data2/tjenks/maerl_genomics/clean_reads
seedDIR=/data2/tjenks/maerl_genomics/get_organelle/seed_sequences

get_organelle_from_reads.py -1 $readDIR/${R1_FQ} -2 $readDIR/${R2_FQ} -o ${R1_FQ%_*} -s $seedDIR/Lithothamnion_plastome.fasta,$seedDIR/Pcalcareum_psbA_rbcL.fasta -F embplant_pt,embplant_pt -t ${PROC} --which-bowtie2 /data2/tjenks/software/bowtie2-2.4.1-linux-x86_64/ --prefix ${R1_FQ%_*} --no-spades

cd ${R1_FQ%_*}

unicycler -1 ${R1_FQ%_*}filtered_1_paired.fq -2 ${R1_FQ%_*}filtered_2_paired.fq -o unicycler_assembly --mode conservative

Second, navigate to the directory containing the clean reads and run the following code which writes another bash script that sets up the input files to runGetMitogenome.sh.
paste <(ls -1 *R1.fastq.gz) <(ls -1 *R2.fastq.gz) | awk '{print "bash runGetMitogenome.sh", $1, $2}' > ../dir_path/runBash.sh

Finally, execute the runBash.sh script.
bash runBash.sh

Notes:
When the organelle genome was manually extracted from the assembly graph in Bandage (on my Windows laptop), the resulting FASTA file was formatted as a Windows text file. This must be changed in notepad++ to Unix (LF) for it to work with some Unix programs.

Rotate sequences to start at the same position
To rotate all sequences to start at the same position, run the program MARS (Multiple circular sequence Alignment using Refined Sequences) which takes a FASTA file of sequences as input.
mars -a DNA -i input.fasta -o rotated.fasta

3. Annotate organelles

After extracting an organelle contig, the organelle sequence was used as input to annotation software which predicts and annotate the genes captured in the organelle assembly. For red algae mitochondria, the NCBI genetic code used for prediction and annotation appears to be transl_table4 (The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code). For red algae plastid genomes, the NCBI genetic code used for prediction and annotation appears to be transl_table11 (The Bacterial, Archaeal and Plant Plastid Code). For a recent study on the evolutionary histories of red algal organelles see Lee et al. 2018.

The Genetic Code - Khan Academy

Software
GeSeq - annotation of organelle genomes, in particular chloroplast genomes.
MITOS - annotation of metazoan mitochondrial genomes.
MFannot - annotation of organelle genomes (helpful for genomes with many introns, outputs proteins sequences).
OGDRAW - visualisation of circular genome
GenomeVx - visualisation of circular genome

Note
After examining the output of both BLASTn and GeSeq, I noticed that some samples had the mitogenome or plastome sequence in reverse, which explained why there were large unexplanable differences in pairwise identity between two groups of samples using the ClustalW and Mafft aligners. To get around this, I took the reverse complement of one sequence 'type' (I called this type B) and then re-annotated the reversed sequences so that the gene coding sequences were in the same direction for every sample.

3.1 Mitogenome annotation

Mitogenome sequences were annotated using GeSeq. The Lithothamnion sp. GenBank record was used as a reference and the Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code was used as the genetic code for the annotation of tRNAs using ARWEN 1.2.3.

In [8]:
# Import annotation of Tre04 mitogenome
from IPython.display import Image
Image("Jupyter_files/Tre04_mitogenome_GeSeq.jpg", width = 800, height = 800)
Out[8]:

The coding sequences (CDS) output from GeSeq for each sample was imported into Geneious 6.1.8. The CDS were sorted alphabetically and concatenated to generate one sequence for each sample containing 24 CDS.

In [1]:
# Import csv file
import pandas as pd
align_stats = pd.read_csv("Jupyter_files/mitogenome_cds.csv")

# Print all rows of the data
pd.set_option('display.max_rows', None)
align_stats
Out[1]:
CDS CDS description Function Approximate length (bp)
0 ATP6 ATP synthase subunit 6 Enzyme involved in oxidative phosphorylation 751
1 ATP8 ATP synthase subunit 8 NaN 151
2 ATP9 ATP synthase subunit 9 NaN 231
3 COX1 Cytochrome C oxidase I Enzyme in electron transport chain 1,594
4 COX2 Cytochrome C oxidase II NaN 748
5 COX3 Cytochrome C oxidase III NaN 819
6 CYTB Cytochrome b Electron transport chain 1,139
7 ND1 NADH dehydrogenase subunit 1 Catalyses transfer of electrons from NADH to u... 968
8 ND2 NADH dehydrogenase subunit 2 NaN 1,494
9 ND3 NADH dehydrogenase subunit 3 NaN 303
10 ND4 NADH dehydrogenase subunit 4 NaN 1,471
11 ND4L NADH dehydrogenase subunit 4L NaN 306
12 ND5 NADH dehydrogenase subunit 5 NaN 1,923
13 ND6 NADH dehydrogenase subunit 6 NaN 590
14 orf1 NaN NaN 151
15 rpl16 16S rRNA Translation of mRNAs into mitochondrial proteins 321
16 rps3 NaN NaN 693
17 rps11 NaN NaN 344
18 rps12 12S rRNA Translation of mRNAs into mitochondrial proteins 327
19 sdh2 NaN NaN 665
20 sdh3 NaN NaN 371
21 sdh4 NaN NaN 228
22 secY NaN NaN 399
23 ymf39 NaN NaN 555

Align all mitogenome CDS

A FASTA file containing the concatenated CDS for each sample was exported and alignment was conducted in Unix using Mafft following the on-screen instructions.
module load MAFFT/7.305-foss-2018b-with-extensions
mafft

The alignment was then re-imported into Geneious and a UPGMA tree was built using Tamura-Nei genetic distances. Two alignments and trees were generated: (1) Only Phymatolithon calcareum samples, (2) All maerl samples (including Scotland Phymatolithon sp. and Lithothamnion sp.).

3.2 Plastome annotation

Plastome sequences were annotated using GeSeq. The Lithothamnion sp. GenBank record was used as a reference and The Bacterial, Archaeal and Plant Plastid Code was used for annotation and ARAGORN v1.2.38 was used to annotate tRNAs.

In [9]:
# Import annotated plastome of Tre04
from IPython.display import Image
Image("Jupyter_files/Tre04_plastome_GeSeq.jpg", width = 800, height = 800)
Out[9]:

The coding sequences (CDS) output from GeSeq for each sample was imported into Geneious 6.1.8. The CDS were sorted alphabetically and concatenated to generate one sequence for each sample containing 24 CDS.

Align all plastome CDS

A FASTA file containing the concatenated CDS for each sample was exported and alignment was conducted in Unix using Mafft following the on-screen instructions.
module load MAFFT/7.305-foss-2018b-with-extensions
mafft

The alignment was then re-imported into Geneious and a UPGMA tree was built using Tamura-Nei genetic distances. Two alignments and trees were generated: (1) Only Phymatolithon calcareum samples, (2) All maerl samples (including Scotland Phymatolithon sp. and Lithothamnion sp.).

4. Organelle analysis

The FASTA alignments were imported into R for analysis and tree building.

5. Reference genome assembly

Sequence reads were trimmed using bbduk (BBMap tools) and a de novo reference was assembled from one sample (Mor02) using abyss-pe using standard parameters. The contigs from the reference were run through BLASTn and any contigs with blast hits to bacteria or other contaminating species were removed. (This step was done by SNPsaurus).

6. SNP calling

6.1 Alignments to reference

Align reads using bwa-mem2

Index reference
bwa-mem2 index assembly_Mor02_final.fasta

Prepare a bash script that aligns clean reads to the reference using bwa-mem2

#!/bin/sh

PROC=6
R1_FQ="$1"
R2_FQ="$2"
ref=/data2/tjenks/maerl_genomics/assembly_pcal/clean_maerl_Mor02_250.fa
clean=/data2/tjenks/maerl_genomics/clean_reads

bwa-mem2 mem -t ${PROC} $ref $clean/${R1_FQ} $clean/${R2_FQ} > ${R1_FQ%_*}.sam
samtools view --threads ${PROC} -b ${R1_FQ%_*}.sam > ${R1_FQ%_*}.bam
samtools sort --threads ${PROC} ${R1_FQ%_*}.bam -o ${R1_FQ%_*}.sorted.bam
rm ${R1_FQ%_*}.sam
rm ${R1_FQ%_*}.bam

Navigate to the directory containing the clean reads and run the following command to prepare the input files
paste <(ls -1 *R1.fastq.gz) <(ls -1 *R2.fastq.gz) | awk '{print "bash x_align.sh", $1, $2}' > ../12.bwa-mem2_alignments/y_input.sh

Execute by running bash y_input.sh

Filter alignments and generate stats

Count and print the mapping scores for the alignments (The highest quality score from bwa-mem2 is 60)
samtools view input.bam | cut -f 5 | sort -n | uniq -c

Filter alignments
for file in *.bam; do filename="${file%%.*}"; samtools view --threads 6 -f 0x2 -bq 30 $file > ${filename}.sorted.filt.bam; done
for file in *.filt.bam; do filename="${file%%.*}"; samtools flagstat --threads 6 $file > ${filename}.sorted.filt.bam.stats; done

Parameter Description
-f 0x2 only retain alignments in which paired reads properly aligned
-q only include reads with a mapping quality of 30
-b output BAM

Count the number of mapped reads for each sample
cat *.stats | grep "mapped (" | awk '{print $1}' > reads_mapped_to_reference

Alignment stats

In [1]:
# Import csv file
import pandas as pd
align_stats = pd.read_csv("Jupyter_files/sequencing_alignment_stats.csv")

# Print all rows of the data
pd.set_option('display.max_rows', None)
align_stats
Out[1]:
Site Sample HQ reads MAPQ30 reads % Mapped
0 Armacao de Pera Arm01 30967760 22249151 71.8
1 Armacao de Pera Arm02 842630 603215 71.6
2 Bornalle Bor01 13418354 8789086 65.5
3 Bornalle Bor02 11140932 7626690 68.5
4 Bornalle Bor03 8346420 5858196 70.2
5 Bornalle Bor04 10039870 6857527 68.3
6 Bornalle Bor05 18298828 11678387 63.8
7 Bornalle Bor06 9614332 6357822 66.1
8 Bornalle Bor07 11545642 7983793 69.1
9 Bornalle Bor08 10092904 5894087 58.4
10 Bornalle Bor09 31727426 20435149 64.4
11 Bornalle Bor10 12482974 7827782 62.7
12 Bornalle Bor11 38737724 25097998 64.8
13 Bornalle Bor12 23189250 15591565 67.2
14 Falmouth, St Mawes Fal01 9200438 4564781 49.6
15 Falmouth, St Mawes Fal02 12142774 7930322 65.3
16 Falmouth, St Mawes Fal03 16 0 0.0
17 Falmouth, St Mawes Fal04 6762302 4017408 59.4
18 Falmouth, St Mawes Fal05 3116070 1806536 58.0
19 Falmouth, St Mawes Fal06 9380120 5376406 57.3
20 Falmouth, St Mawes Fal07 23904194 14059779 58.8
21 Falmouth, St Mawes Fal08 2916964 1977575 67.8
22 Falmouth, St Mawes Fal09 19737868 12306151 62.3
23 Falmouth, St Mawes Fal10 20999630 13029468 62.0
24 L. corallioides Lcor05 456912 20016 4.4
25 L. corallioides Lcor06 493628 18966 3.8
26 L. corallioides Lcor08 10910752 381651 3.5
27 L. corallioides Lcor09 18758540 741107 4.0
28 L. corallioides Lcor11 2551710 81520 3.2
29 The Manacles MCZ Man03 11857468 6514960 54.9
30 The Manacles MCZ Man04 9022876 3216899 35.7
31 The Manacles MCZ Man21 12338794 4784391 38.8
32 The Manacles MCZ Man24 10852916 6824967 62.9
33 The Manacles MCZ Man28 5215302 3099429 59.4
34 The Manacles MCZ Man44 8828842 5209956 59.0
35 Milford Haven Mil01 2652022 1688838 63.7
36 Morlaix Mor01 7473696 4505206 60.3
37 Morlaix Mor02 35256386 17615849 50.0
38 Morlaix Mor03 18371744 10979720 59.8
39 Morlaix Mor04 4455062 2924626 65.6
40 Morlaix Mor05 7244144 4114713 56.8
41 Morlaix Mor06 21307824 12989031 61.0
42 Morlaix Mor07 15663480 10130626 64.7
43 Morlaix Mor08 7871390 4777548 60.7
44 Morlaix Mor09 17799290 10460562 58.8
45 Morlaix Mor10 6 0 0.0
46 Morlaix Mor11 11722550 6360609 54.3
47 Morlaix Mor12 22273346 13893009 62.4
48 Norway Nor01 3308942 2187265 66.1
49 Illa de Ons Ons01 25172838 17424847 69.2
50 Illa de Ons Ons02 6294708 4386629 69.7
51 Illa de Ons Ons03 14200544 9573564 67.4
52 Illa de Ons Ons05 10624306 7556152 71.1
53 Illa de Ons Ons06 7461200 5016457 67.2
54 Illa de Ons Ons07 8475908 5772290 68.1
55 Illa de Ons Ons08 13862834 8806237 63.5
56 Illa de Ons Ons09 10731880 6879242 64.1
57 Illa de Ons Ons10 22299088 15518380 69.6
58 Illa de Ons Ons11 10674706 7221286 67.6
59 Illa de Ons Ons12 12679156 7398197 58.3
60 La Rochelle Roc01 17138538 11639841 67.9
61 La Rochelle Roc02 8938986 6020072 67.3
62 Scotland, Loch Sween Sco01 7293046 174389 2.4
63 Scotland, Loch Sween Sco03 12272232 238729 1.9
64 Scotland, Loch Sween Sco04 3188198 110132 3.5
65 Scotland, Loch Sween Sco05 1616992 103121 6.4
66 Scotland, Loch Sween Sco06 319896 21867 6.8
67 Scotland, Loch Sween Sco07 10711958 235201 2.2
68 Scotland, Loch Sween Sco09 4030604 154254 3.8
69 Scotland, Loch Sween Sco10 969756 34241 3.5
70 Scotland, Loch Sween Sco11 4800428 148687 3.1
71 Scotland, Loch Sween Sco12 13488646 285725 2.1
72 Scotland, Loch Sween Sco14 13747288 291532 2.1
73 Scotland, Loch Sween Sco15 8080054 194597 2.4
74 Trevignon Tre01 15730626 9943452 63.2
75 Trevignon Tre02 3577346 2106361 58.9
76 Trevignon Tre03 16959414 6391941 37.7
77 Trevignon Tre04 10912580 5813457 53.3
78 Trevignon Tre05 18633590 11266784 60.5
79 Trevignon Tre07 17115224 8556194 50.0
80 Trevignon Tre08 16022716 9931383 62.0
81 Trevignon Tre09 13382208 8712345 65.1
82 Trevignon Tre10 6517174 4447695 68.2
83 Zara Shoal, Strangford Lough Zar01 13347266 8943289 67.0
84 Zara Shoal, Strangford Lough Zar02 4707486 3171579 67.4
85 Zara Shoal, Strangford Lough Zar03 7939382 5427246 68.4
86 Zara Shoal, Strangford Lough Zar04 12634560 8804987 69.7
87 Zara Shoal, Strangford Lough Zar05 9523352 6630316 69.6
88 Zara Shoal, Strangford Lough Zar06 11094356 7648517 68.9
89 Zara Shoal, Strangford Lough Zar07 10398910 6695879 64.4
90 Zara Shoal, Strangford Lough Zar08 11268792 7696083 68.3
91 Zara Shoal, Strangford Lough Zar09 16 0 0.0
92 Zara Shoal, Strangford Lough Zar10 16661870 11365022 68.2
93 Zara Shoal, Strangford Lough Zar11 15908674 10996515 69.1
94 Zara Shoal, Strangford Lough Zar12 14396370 9980165 69.3
95 Total 1101098746 621175217 56.4

6.2 Call and filter variants

1. Call variants using bcftools

Create alignment list file
ls -1 ../12.bwa-mem2_alignments/*.filt.bam > alignment_list

Run bcftools mpileup and pipe the output to bcftools call
bcftools mpileup --threads 6 --annotate FORMAT/AD,FORMAT/DP,INFO/AD -Ou -f ../assembly_pcal/assembly_Mor02_final.fasta --bam-list alignment_list | bcftools call --threads 6 --variants-only --multiallelic-caller -Ov -o variant_calls.vcf

Create a file containing old vcf header and new vcf headers
paste <(cat alignment_list) <(ls -1 ../12.bwa-mem2_alignments/*.filt.bam | grep -o -P ".{0,5}.sorted" | sed 's/.sorted//g') > rename_samples
Edit file to add L in front of Lcor samples.

Rename vcf headers
bcftools reheader --samples rename_samples -o variant_calls.vcf.rename variant_calls.vcf

2. Call variants using bcftools with ploidy model

Create samples list file with ploidy
ls -1 ../raw_reads/*fastq.gz | awk '{print $0, "\t", "3"}' > ploidy_list.txt

Run bcftools mpileup and pipe the output to bcftools call
bcftools mpileup --threads 6 --annotate FORMAT/AD,FORMAT/DP,INFO/AD -Ou -f ../assembly_pcal/clean_maerl_Mor02_250.fa --bam-list alignment_list | bcftools call --threads 6 --variants-only --multiallelic-caller -Ov -o variant_calls.vcf --samples-file ploidy_list.txt

ISCA

Call variants using bcftools
module load BCFtools/1.9-foss-2018b
bcftools mpileup --threads 16 -Ou -f ../assembly_Mor02_final.fasta --bam-list alignments_list.txt | bcftools call --threads 16 -mv -Ov -o variant_calls.vcf
[Time: 14:13:01]

TEST: Call variants using bcftools with ploidy model
module load BCFtools/1.9-foss-2018b
bcftools mpileup --threads 16 --annotate FORMAT/AD,FORMAT/DP,INFO/AD -Ou -f ../../assembly_Mor02_final.fasta --bam-list ../alignments_list.txt | bcftools call --threads 16 --samples-file ploidy_list.txt --variants-only --multiallelic-caller -Ov -o variant_calls.vcf
[Time: ]

Convert bcf to vcf (if required)
bcftools view --output-type v variant_calls.bcf > variant_calls.vcf

Filter vcf using vcftools

Filter 1 (reduce dataset)
vcftools --vcf variant_calls.vcf --out filter1 --recode --recode-INFO-all --max-missing 0.70 --minQ 30

Filtering step Number of SNPs
Start XXX
--max-missing 0.70 --minQ 30 XXX

Filter 2 (remove individuals with missing data)

Assess individual missing data
vcftools --vcf filter1.recode.vcf --missing-indv --out missingdata
cat missingInd.imiss

Create a list of individuals with more than 20% missing data
awk '$5 > 0.20' missingdata.imiss | cut -f 1 > missingdata-greater20

Remove individuals from data set
vcftools --vcf filter1.recode.vcf --out filter2 --recode --recode-INFO-all --remove missingdata-greater20

Filter 3 (all filters)
vcftools --vcf filter2.recode.vcf --out filter3 --recode --recode-INFO-all --max-missing 0.97 --minDP 5 --maxDP 100 --min-alleles 2 --max-alleles 2 --remove-indels --mac 5

VCFtools filtering step Number of SNPs
--max-missing 0.97
--minDP 5
--maxDP 100
--min-alleles 2 --max-alleles 2
--remove-indels
--mac 5 XXX

Filter 4 (linkage disequilibrium)
vcftools --vcf filter3.recode.vcf --geno-r2 --min-r2 0.7

Tabular result:

CHR POS1 POS2 N_INDV R^2
tig_32046 6390 6399 66 1
tig_76165 173 1058 64 1
tig_76165 173 2675 65 1
tig_76165 1058 2675 64 0.84
tig_...n x x x x

Extract unique rows of POS1 (column 2) which represent one of the loci in LD to remove from the dataset and redirect only the CHR and POS1 columns to a text file.
cat out.geno.ld | awk '!seen[$2]++' | grep -v "^CHR" | cut -f 1,2 > loci_in_ld.txt
wc -l loci_in_ld.txt

Remove LD loci from data set
vcftools --vcf filter3.recode.vcf --out filter4 --recode --recode-INFO-all --exclude-positions loci_in_ld.txt

VCFtools filtering step Number of SNPs
--geno-r2 (r2 > 0.7) XX

Extract variants between P. calcareum and L. corallioides
module load VCFtools/0.1.16-foss-2018b-Perl-5.28.0
vcftools --vcf ../variant_calls.vcf --out ./hybrid1 --recode --recode-INFO-all --keep hybrid_ind_to_keep.txt
vcftools --recode --recode-INFO-all --vcf hybrid1.recode.vcf --out hybrid2 --minQ 30 --max-missing 0.90 --min-alleles 2 --max-alleles 2